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ABSTRACT 

We propose a robust principal component analysis (PCA) framework for the exploita- 
tion of multi-band photometric measurements in large surveys. Period search results 
are improved using the time series of the first principal component due to its optimized 
signal-to-noise ratio. The presence of correlated excess variations in the multivariate 
time series enables the detection of weaker variability. Furthermore, the direction of 
the largest variance differs for certain types of variable stars. This can be used as an 
efficient attribute for classification. The application of the method to a subsample of 
Sloan Digital Sky Survey Stripe 82 data yielded 132 high-amplitude 5 Scuti variables. 
We found also 129 new RR Lyrae variables, complementary to the catalogue of Sesar] 
|et al.| (j2010|), extending the halo area mapped by Stripe 82 RR Lyrae stars towards 
the Galactic bulge. The sample comprises also 25 multiperiodic or Blazhko RR Lyrae 
stars. 
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During the last decade, wide-area and multi-epoch surveys 
have started to play a major role in astronomical research. 
Developments in astronomical instrumentation and in space 
observation techniques, together with the rapidly growing 
data storage facilities and the broadly available software for 
combining these data provide an enormous wealth of infor- 
mation. The traditional manual procedures must be replaced 
by quick automated methods for pre-processing, selection 
and analysis. 

Analysis of variable stars has benefited greatly from 



data collected by large-scale surveys such as AS AS ( Poj 
manski|[2002l|2003[), OGLE ([Udalski, Kubiak fc Szymanski 



1997[) MACHO ([Alcock et aL||1997| ), or EROS ( [Aubourg 



et al. 1993 Spano et al. 20111. Studies of different, some- 



times rare classes of objects become possible with unprece- 
dentedly large sets of objects. These studies require specific 
pre-processing: a preliminary classification of the observed 
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objects, in order to enable an efficient extraction of homoge- 
neous samples. The first step in this procedure is variability 
detection, providing a set of candidate variable stars. The 
next is characterization of the observed time series by the 
calculation of some numerical summaries of the observed 
time series. These are usually statistical moments (e.g. mean 
magnitude, skewness, kurtosis), derived quantities from pe- 
riod search (e.g. amplitudes and relative phases of harmonic 
components), and some astrophysical parameters such as 
colours. These parameters, called generally attributes in the 
context of classification, are then used to estimate the types 
of the objects. The volume of data requires fast automated 
data mining techniques like, for instance, those proposed by 



Rimoldini et al. 



Eyer fc Blake] ( 2002[ |2005[). Most re c ently. Random Forest 
(Breiman||2001[| Dubath ct al. "20111 [Richards et al.||2011 



20121, multistage Bayesian networks (De- 
bosscher et al.|2007 Sarro et al.|2009[ ) and gradient boosting 
methods (Richards et al.|2011l were tested for this purpose 



with promising results. 

Unfortunately, automatically distributed class labels 
can be less reliable than the results of a careful human in- 
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spection. A more efficient use of the information contained 
in the data can improve on this. The goal of this study is 
to consider a new way of including colour information into 
automated classification procedures. Although variable stars 
with different origin of variability and different physical pa- 
rameters show different amplitudes and light curve patterns 
depending on wavelength during their cycle, this variation 
is not easy to summarize in a concise numerical form, as it is 
a function of phase. We attempt to find summaries of these 
colour variations, and to use it as complementary informa- 
tion to the usual attributes in automated classification of 
multi-band survey data. 

Principal component analysis (PCA, 



Hastie, Tibshirani & Friedman 2009 1 gives precisely such 



Jolhffe 2009 



summaries. It considers the vector of observations in M 
bands at one time as a point in an M-dimensional space, 
and looks for a decomposition of the space of observations 
into directions with maximal variance. The first direction 
is called the first principal component or PCI. The pro- 
jection of any point to the PCI direction is a linear com- 
bination of the simultaneous measurements with constant 
coefficients, which is called PCI score. It provides several 
potential advantages. It improves signal-to-noise ratio when 
searching for periods on the time series of the projections on 
the direction of PCI. It yields a variability criterion based on 
the presence of correlated variations across filters. Also, the 
direction of PCI is characteristic to the origin of variabil- 
ity and to the physical properties of the star, and is useful 
to separate eclipsing binaries from some pulsating variables 
with symmetric light curves such as RR Lyrae type c (here- 
after RRc). We demonstrate these advantages on 5- band 
time series from a flux-limited sample of Stripe 82 objects 
of the Sloan Digital Sky Survey (SDSS). 

However, applying PCA on astronomical time series 
is not straightforward. Usually, the data in different filters 
have different error levels, depending on the measured mag- 
nitudes too. Outliers and non-normality can also strongly 
influence PCA, which is built entirely on the assumption of 
normality. Since usually not all type of errors can be antic- 
ipated and accounted for in advance, the robustness of the 
methods is very desirable: a good automated method should 
be able to function acceptably well even in the presence of 
contaminations from various unknown sources. We propose 
a combination of variance stabilizing transformation with ro- 
bust principal component analysis to minimize the impact 
of these issues. 

After constructing the methodology, we use it to select 
samples of candidate RR Lyrae and high-amplitude 5 Scuti 
(HADS) from SDSS Stripe 82. RR Lyrae are interesting as 
they are Population II halo structure tracers, and obey well- 



determined period-luminosity relations (e.g. Smith 1995 1 



The HADS stars are also pulsating variable stars with spec- 
tral type from late A to early F, occurring in the instabil- 
ity strip on and just above the main sequence below the 
RR Lyrae (e .g. |Breger|p80| [Clement et"aL]p00T| |Pigulski| 



et al. 20061. Their pulsation is in majority radial- mode. 



though for some, there are indications of non-radial modes 
| |Mazur, Krzeminski fc Thompson_2003l|Porettii2003 ). They 
also satisfy period-luminosity relations depending on metal- 



licity and oscillation mode (Nemec, Nemec & Lutz 1994 
|McNamara|p95l pfT] |2000| ), but Population I {6 Scuti) 
and II (SX Phoenicis) objects are mixed in the group, and 



they are affected also by mode identiflcation problems. In 
Stripe 82 data, the conflrmed RR Lyrae stars published by 



Sesar et al. (20101 will be used as a performance test and 



a training set for the PCA-based methodology. For HADS 
stars, we build a training set to obtain a clean sample in 
Stripe 82. Finally, we present a list of 129 new RR Lyrae 
and 132 HADS candidate variables. 

In Section[2| we briefly present SDSS Stripe 82 (York et 



ar]|2000l. Section [3] summarizes the statistical background 



variance stabilizing transformation and principal component 
analysis with its robust version, the tested period search 
variants, and the Random Forest classifier. Section [4] de- 
scribes the results obtained by PCA in variability detection, 
period search, characterisation and classification. A short 
summary and a discussion follows in Section [5] 
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The Sloan Digital Sky Survey Data Release 7 ( |Abazajian| 
et al. 2009 1 provides five-band (u, g, r, i and z) photom- 
etry of more than 11000 deg^ of the sky. The photometric 
errors are around 0.02 mag for 17 < 16 mag, and around 
0.04 mag for <? ~ 21 mag (see Fig. 2 of Sesar et al. 20101. 



Equivalent values for the noisiest u band are 0.02 mag at 
the bright end and 0.05 mag around 20 mag. The 95% 
completeness limits of the images are u = 22.0, g = 22.2, 
r = 22.2, i = 21.3, z = 20.5 (lAbazajian et al.||2004l). One 



of the southern regions. Stripe 82, was observed repeatedly 
during the first phase of SDSS (SDSS-I) and the following 
SDSS-II Supernova Survey (Bramich et al.|2008 Frieman et 



|al.||2008] ). The measurements in the five bands were taken 
quasi-simultaneously, with around 1.2 minute time differ- 
ence between band records. 

Our data set is a ffux-limited subset of ISesar et al.l 
( j2007| ), separated into around 68,000 variable and 200,000 
non-variable objects with median g magnitude brighter than 
20.5 mag. The variable flag used there was based on two 
criteria: (1) for the root mean square scatter in g and r, 
ar > 0.05 and ag > 0.05; (2) for the reduced chi-square in 
r and g, ^r ^ 3 and ^g Js 3, respactively. This selection will 
be referred as 'flagged as variable' or 'flagged non-variable' 
hereafter, and will serve in comparisons with the PCA-based 
detection method. Details about the construction and test- 



ing of the catalogue are described in Ivezic et al. (20071 



Sesar et al. (20101 published 483 conflrmed RR Lyrae vari- 



ables in Stripe 82; our subset of data contains 450 of these. 
This makes Stripe 82 data particularly adapted to test our 
methods, as we can check their performance by direct com- 
parison to previous analyses of the same data. 



3 STATISTICAL TOOLS 

3.1 Principal Component Analysis 

Principal component analysis is a statistical tool developed 
for finding orthogonal directions of maximal variance in 
high-dimensional data sets. It is assumed that directions of 
large variance are of particular interest: in signal processing, 
they may contain the bulk of the information transmitted by 
a signal; in image analysis, they may concisely summarise 
the most characteristic pattern forms; or in classification, 
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they might coincide with the directions that best separate 
some distinct groups of objects. In order to find them, we 
consider the point cloud of multi-filter observations plot- 
ted against each other, regardless of times. The goal is to 
find an orthogonal coordinate system adapted to this point 
cloud: the first axis is fixed so that it points to the direction 
of maximal variance. Next, the points are decomposed into 
projections to this first axis and to its orthogonal subspace. 
Then the second axis is chosen to point into the direction 
of largest variance in the orthogonal subspace. These steps 
are repeated until an orthogonal basis spanning the original 
space is found. Mathematically, we seek for a successive de- 
composition of the space of the observables into orthogonal 
directions to which the projections of the points have the 
highest residual variance. 

Suppose that we observed M sequences 
^11 , X21 , . . . , Xni ; . . . ; XiM , X2M , . . . , Xnm at A'" 
times, with the values Xii,Xi2, ■ ■ ■ ,XiM observed simul- 
taneously. This corresponds to a sequence of observations 
made in an M-dimensional space, where a joint observation 
Xii,Xi2,. ■ ■ ,XiM is represented by a point in the M- 
dimensional state space. If we define the matrix X so that 
its columns are the M sequences, and the rows correspond 
to the different M-dimensional observations, then we can 
write the sample covariance matrix as X.'^X./N, with the 
superscript T denoting transposition. Finding orthogonal 
directions of maximal variance turns out to be equivalent 
to finding the eigendecomposition 



X^X: 



VD^V^. 



Here the columns of V are the eigenvectors, or in other 
terms, the principal directions of X, and D^ is a diagonal 
matrix with non-negative values df, dl, . . . , d%i, representing 
the variances of the projections of the points on the princi- 
pal directions. By convention, the order of the directions is 
such that di > di > ■ ■ • > d\.i , and therefore the first princi- 
pal component corresponds to the direction of the maximal 
variation. The vector of projections of the points on the first 
principal direction can be written as the linear combination 
zi = Xvi, where vi is the first column of the matrix V. 

How to apply this for variable star analysis? For M time 
series of a star consisting of A'^ simultaneously taken points, 
we consider the M-dimensional point cloud of the observa- 
tions, as shown in Fig. IT] We intend to apply principal com- 
ponent analysis in order to find the direction where variabil- 
ity is maximal, so that we can find more easily the period of 
the variable star. However, the application immediately hits 
an obstacle visible in Fig.fl] the different average noise in the 
different bands. Even if a star shows coherent deterministic 
variations across the bands, this can be easily masked by the 
noise in one of the bands (in SDSS, the noisiest bands are 
generally the u and the z band). PCA will pick the direction 
of the noisiest band as shown by the blue ellipsoids in the up- 
per middle and right panels of Fig. [ij and not the direction 
of coherent variations. The remedy is to use the estimated 
errors to scale the observations so that we have unit vari- 
ance in every band. In the case of a non-variable star and 
near-independent errors, this implies that the point cloud 
appears as a sphere. For a variable star, the noise is added 
to a deterministically varying light curve, which causes the 
centre of the sphere to move in the M-dimensional space, 
and instead of a ball, we observe an elongated ellipsoid-like 
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Figure 1. M-, g- and r-magnitudes of a randomly chosen star 
flagged as variable from SDSS Stripe 82. The red ellipsoids are 
levels of equal probability density fitted by non-robust maximum 
likelihood, the blue ones are the same levels from a robust fit by 
the minimum covariance determinant method. 



shape. The first principal direction is the longest axis of this 
shape, providing the largest variability amplitude that can 
be obtained by a linear transformation. 

Assume that the distribution of the observations is 



Xira ~ M{pLm, o"?m)i that Is, the Star is constant with mean 
magnitude ^m in band m, the standard deviation of the 
measurement Xim around the mean is Oim, and the error 
distribution is normal. We have several options to obtain 
unit variance of the noise. 

(i) We can centre and scale the measurements by an es- 
timate of the centre and the pointwise error estimates 
(called here local scaling): 



Y — 



-^in 



A*r7 



for all m = 1, 



,M, 



where jim can be either the mean or the median of 

(ii) We can centre and scale the measurements by an es- 
timate of the centre and the sqare root of the average 
variance of the time series: 



Y — 



M" 



for all m = 1, 



,M, 



0"m 

where fim and a^ can be either the mean (non-robust 
scaling) or the median (robust scaling) of Xim and of 



j2 



(iu) 



can be assumed normal, an analogous matrix transfor- 
mation can be based on the covariance matrix. In this 
paper, we assume that the correlation between the er- 
rors is weak compared to the correlation between bands 
for an RR Lyrae or HADS star ( Scranton et al.|2005 |. 
Considering that the errors depend on true magni- 
tudes, and this effect can be strong for some classes 
of variable stars or very faint objects, we tested also 



4 M. Siiveges et al. 



the variance stabilizing transformation ( Everitt|2002 |. 
Suppose we have a random variable X from a distribu- 
tion with mean /i and variance a^ for which a^ — g{fJ.). 
Define the function / as the solution to f'{x) — 



l9{^ 



^1/2 



Then the transformation Y — f{X) leads to 



a variable with unit variance: Var(y) — 1. In our case, 
we supposed a functional form a^ — g{x) — aexp{bx) 
between the variances (t^„ and centred observed mag- 
nitudes Xim with different coefficients for each band. 
This corresponds to a rough approximation with Pois- 
son noise, which leads to an easily tractable closed- 
form transformation. Fitting this function bandwise 
to observed magnitudes and squared standard errors 
of a sample of variable and non-variable stars resulted 
in the bandwise estimates am and b,n and a transfor- 
mation Yim = "ia-m b^ exp(—bm,Xim/2) for band m. 
Then the point cloud of Yim is centred by removing 
either the mean or the median from it. 



The assumption of normality can be checked by 
quantilo-quantile (Q-Q) plots ( [Chambers et al.|1983|[Cleve-| 
Uan3 1994). After scaling and centring, the ordered sample 
of transformed magnitudes of a star is plotted against the 
corresponding theoretical quantiles of the standard normal 
distribution. If the sample follows indeed the standard nor- 
mal distribution, then the points should be aligned along 
a straight line with intersect and slope 1. If the assump- 
tion of normality is true, but the mean and the variance are 
not and 1, the points still fit on a straight line, but the 
intersect and the slope of the line change to the mean and 
the standard deviation, respectively. If even normality is not 
satisfied, the points deviate systematically from the straight 
line. Figure [2] shows the Q-Q plots of a constant star for all 
variants of scaling, with the expected standard normal line 
pictured in red, and the best-fitting normal distribution in 
blue. The sample is visibly not normal in any of the bands: 
the tails deviate from both lines, indicating heavier tails (the 
absolute values of the observations are larger than what is 
expected from a normal distribution). The blue and the red 
lines have only slightly different slopes in most of the plots, 
suggesting that the scaling did obtain approximately unit 
variance, though the assumption of normality is not valid. 

This implies that the non-robust PCA, built on the 
assumption of normality, is not appropriate for our pur- 
poses. Another important issue that violates this distribu- 
tional condition is the presence of outliers. We can see their 
effect in Fig. [l] a single outlier distorts completely the es- 
timated normal distribution, outlined by the red ellipsoid. 
Robust versions of PCA are given in the statistical litera- 
ture, among which we chose the minimum covariance deter- 
minant method (Rousseeuw 19851. This involves repeated 



random subsampling of the data, leaving out a fixed per- 
cent of the observations each time. Then the covariance ma- 
trix is computed using each subsample, and the one having 
minimal determinant is chosen as the robust estimate. The 
blue ellipsoids in Fig. IT] are the result of the application of 
this method. The percent of left-out data is fixed so that 
the effect of a single outlier is decreased, but that of a few 
consistently located outlying observations is preserved. This 
choice is motivated by two conflicting aims. One is to get rid 
off of the effects of true erroneous observations. The other is 
to preserve that of those that are true representatives of the 



light curve, though they are rare; for instance scarce mea- 
surements of the dips in detached eclipsing binaries. Indeed, 
the blue ellipsoids in Fig. IT] represent better the apparent 
distribution of the bulk of the data. 

Robust PCA yields some directly useful quantities. The 
first are the coefficients vi of the linear combination, which 
characterize the direction of the first principal component. 
It depends on the type of variability, and thus can be used in 
classification. The second are the scores of the observations 
on the first principal component, that is, the time series 
of the linear combinations zi = Xvi. By convention, the 
linear coefficients vi are fixed so that the variance of the 
noise after transformation remains one, so the improvement 
on the signal-to-noise ratio is X]m=i^»"i^™' where Am is 
the signal amplitude in band m after scaling. The third is 
the variance df of zi, which is related to the elongation of 
the point cloud. Its ratio to the total variance X]m=i '^"i 
(called hereafter the variance proportion) gives indication 
about the presence of variability. Finally, the distances of 
the observations from the centre with respect to the robust 
variance-covariance matrix fit indicate outlyingness of the 
observations, and can be used to weight the observations in 
period search. 



3.2 Period search 



We used the generalised Lomb-Scargle method ( Zechmeister 
fc Kiirster[2009 ) to obtain the periodogram of the first prin- 



cipal component, and tested various trimming and weighting 
options on simulated sinusoids with errors imitating SDSS 
error patterns. 

The usual weighting with the inverse squared errors can- 
not be applied for the zi time series, because the scaling and 
the construction of the principal components lead to approx- 
imate unit variance of the noise on zi . Instead, we can base 
a measure of outlyingness on the robust distances, and trim 
or weight the elements of the time series according to this 
measure. The squared distances, r^, in an M-dimensional 
space under the hypothesis of approximate standard normal- 
ity should follow a Xm distribution. Similarly to the princi- 
ple of Q-Q plots, we can determine what value is expected 
for the i-th largest distance, r?j), among a sample of size A'' 
according to the Xm distribution: this will be the i/{N + 1) 
quantile, CM,i/{N+i), of the Xm distribution (TV -f 1 is taken 
instead of N to avoid exactly or 1 values). Such a xi Q-Q 
plot of the robust distances is shown in Fig. |3] for a variable 
star. We checked various alternatives of trimming thresholds 
and weight definitions based on the difference between the 
ith largest observed distance with simulations. 

The goal of the simulations is to reproduce the SDSS 
error levels, error dependence on magnitude, outliers and 
observational cadences. In order to obtain this, we used the 
following procedure: 

1. From 2000 randomly selected objects from the 68,000 
Stripe 82 objects flagged as variable, we extracted the 
sets of five amplitudes based on the excess root mean 
square variability, the sets of five median magnitudes and 
the u-band observational times. 

2. We generated 200 sine functions with random frequen- 
cies in the interval [0, 20] day~^, with random phases, and 
randomly selected sets of five amplitudes and of median 
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Figure 2. Quantile-quantile plots of a non-variable star from Stripe 82. The columns are the five bands u,g,r,i,z. The rows correspond to 
different scaling methods: first row from top, local scaling; second row, non-robust scaling; third row, robust scaling; fourth row, variance 
stabilizing transformation. The red line corresponds to a standard normal distribution, the blue line to a robustly fitted best-fit normal 
distribution to the data. 



magnitudes from those extracted from our sample. Each 
sine function was muhiplied by all five elements of one 
amplitude set, and one set of median magnitudes was 
added to it. We obtained thus 200 pure five-band hght 
curves, each with coherent brightness variations in the 
bands. We sampled these sine waves with time cadences 
extracted from Stripe 82 in the previous step. 
1. We added noise to the hght curves. We divided the full 
magnitude range of the 2000 stars in each filter into 0.5 
mag wide bins, and grouped the error bars according to 
the magnitude value with which it occurred. We obtained 
thus a sample of all error bars, in magnitude bins, for 
all bands. Then for each simulated magnitude values of 
the sinusoidal hght curves in each band, we randomly se- 
lected an error bar value e from the bin corresponding 
to the simulated magnitude value in that band. We gen- 
erated a random number from the jV{0,e^) distribution. 



and added this value as the realised noise to the simu- 
lated magnitude. Finally, we joined e as the error bar to 
the simulated series. 
4. To simulate outliers, we changed some observations to 
fainter values. The differences were random values from 
Af{0, (6e)^), in a random number of filters simultaneously. 
The number of outliers followed a Poisson distribution 
with mean equal to 0.005. 



Visual inspection and Q-Q plots showed that the ob- 
served and simulated hght curves are very similar. The sim- 
ulations were then used to investigate the performance of 
different combinations of scaling, weighting or trimming and 
period search. We found that the combination leading to the 
highest frequency recovery rate was the generalised Lomb- 
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Figure 3. Robust distances from the center of the point cloud for 
the 5-diinensional observations of a star, on a square root scale for 
better visibility. The red line indicates the exact Xj distribution. 
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Figure 4. The logarithm of the difference between the true and 
the found frequency in day~^ for sinusoidal light curve simula- 
tions with approximate SDSS error distributions on g-band (top 
panel) and on PCI (bottom panel). The solid red and dashed blue 
lines are the daily and the yearly aliases, respectively. 



Scargle method with weights defined by 



W 



{1>I 



C5,i/{N+\) 



|}] 



-1/2 



(1) 



-1/2 



withM^ = X;^i [max{l,|rj^ - C5,i/(iv-i-i)|}] "'",apphedto 
either the zi time series derived from variance stabihzed 
observations or to observations scaled with robust estimators 
of the mean and standard deviation. 

For comparison, the generahsed Lomb-Scargle method 
with the usual weights based on error bars was also applied 
to detect periodicity on the single g band, which generally 
had the best signal-to-noise ratio. The results of the best zi 
and the single-band analysis are compared in Fig. |4] The 



logarithm of the difference between the found and the true 
frequency is shown versus the number of observations in 
the time series. Period search on zi outperforms the single- 
band analysis. While the single-band analysis finds a yearly 
or daily alias in 31 cases, the analysis of zi reduces this 
number to only 8. 

Based on these simulation results, in the analysis of the 
variable stars in Stripe 82 we applied the generalised Lomb- 
Scargle method with weights defined by (fTl). For some vari- 
able stars, we also performed a multifrequency analysis. The 
folded zi light curves were fitted with B-splines, restricting 
the smoothing parameter to provide a reasonable degree of 
smoothing (R procedure smooth, spline with the constraint 
0.5 ^ spar ^ 1 ( R Development Core Team||2010 |, and se- 
lecting its optimal value by leave-one-out cross-validation. 
The smoothed light curves zi were inspected visually, and 
the rare over- or undersmoothed cases were corrected by a 
manual selection of the smoothing parameter. The frequency 
analysis of the residuals was not performed on light curves 
where the equivalent degrees of freedom of the smoothing 
was high with respect to the number of data points, since 
this left too few residual degrees of freedom to obtain mean- 
ingful harmonic fits in the residual frequency spectrum. For 
the cases with enough residual degrees of freedom, the resid- 
uals r = zi — zi were extracted from the B-spline model, and 
the same period search algorithm with the same weights as 
for zi was performed on their time series. The significance 
of eventual peaks in the residual periodogram was assessed 



using a combination of non-parametric bootstrap ( Davison 
fc Hinkley [2009 1 and extreme- value methods ( Coles|200T I 



3.3 Random Forest 

Supervised classification methods estimate the class (usu- 
ally a discrete variable like 1, 2, . . . , L or RRAB, RRC, EA, 
EB, ...) for an object of unknown class based on some at- 
tributes (for example period, amplitude, colour, etc.), by fit- 
ting a model to a set of objects with measured attributes and 
known classes (the training set), and then using this for pre- 
diction. Random Forest (Breiman|2001[[Hastie, Tibshirani^ 
Friedman 2009 1 is a popular method which works excellently 



in a very wide range of data mining problems. It consists of 
building a collection ('forest') of classification trees on the 
training set, then passing any new instance down on all trees 
and obtaining the class estimate by the majority vote of the 
forest. 

The training ('growing of the forest') begins with a se- 
lection of a large number of bootstrap samples from the 
training set. Then a classification tree is built separately for 
each sample with binary splits. First a relatively small sub- 
set of attributes is selected randomly. For each of these, the 
split-point is computed that separates the given bootstrap 
sample with the least mixing of classes by some criterion 
(for example, one of the subsets contains only classes A and 
B, whereas the other mostly C and D). Then the tree is split 
according to the attribute for which this separation is the 
cleanest. For the next step, both subsets (called nodes) are 
further split in the same way as was done in the first step: 
from a small random subset of attributes, the one is selected 
that splits best the node in question. The procedure is re- 
peated until each final node is either perfectly clean or has 
a pre-defined minimal size. Then, using another bootstrap 
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sample from the training set, a new tree is built. As a re- 
sult, a large forest of many trees emerges. To classify a new 
object, the prediction of the class by each tree in the forest 
is calculated, and the class with majority vote is accepted 
as the estimate. An estimate of the probabilities to belong 
to each of the classes can be obtained by calculating the 
proportion of votes for each class. 

The main advantage of this procedure is that it obtains 
the class estimate as an average of many individual esti- 
mates by the trees. These estimates are unbiased, but have 
high variance. Averaging preserves unbiasedness and reduces 
variance, and this variance reduction is larger, if the trees are 
less correlated ( [Hastie, Tibshirani fc Friedman||2009[ ). Ran- 
dom Forest achieves de-correlation by applying two tricks: 
first, it uses only bootstrap samples from the training sets, so 
the basic data set driving the learning process is not iden- 
tical for each tree; and second, it uses the best of only a 
random subset of attributes, not of all attributes. The con- 
sequent higher variance of the individual trees is more than 
compensated by the decrease of the variance of the average 
because of the nearly non-correlated trees. 

There are a few tuning parameters in the Random For- 
est procedure: the number of trees in the forest, the number 
of random attributes at the nodes, and the final node size. 
For most classification problems, growing several hundred 
trees is enough, and adding more trees does not improve the 
prediction accuracy. With respect to the final node size, op- 
timal results are achieved by maximally growing the trees 
(that is, the minimal final node size is 1), but it is custom- 
ary in larger problems with many instances and attributes 
to grow the trees only to a somewhat larger final node size 
to speed up the process without great loss of accuracy. The 
most influential tuning parameter is in general the number of 
attributes from which the best split is chosen at the nodes, 
though Random Forest is only weakly sensitive to this as 
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well. Breiman (20011 proposes to choose the largest integer 



k such that k < y/K, where K is the number of attributes. 



4 APPLICATION TO DATA 

4.1 Principal Component Analysis 

We considered only stars that have at least 10 complete 
ugriz observations from our data set. As discussed in Sec- 
tion |3.1| scaling and centring is necessary before applying ro- 
bust PCA. We tested all four methods (local scaling, robust 
and non-robust average scaling and variance stabilization) 
outlined there for variability detection, characterization and 
period search. The requirements of these three tasks are not 
the same, and thus different scaling methods perform bet- 
ter in each one, but inspection of the results suggested that 
variance stabilization yields the best overall performance. 
We present here only the results based on this transforma- 
tion. 



4.1.1 Variability detection based on zi variance 

After scaling and centring the data, we expect to see a spher- 
ical point cloud, similar to a multivariate standard normal 
sample, if the errors are nearly non-correlated and the star 
is constant. Detection of variability is thus equivalent to 
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Figure 5. The proportion of the variance of the first principal 
component to the total variance versus the number of observa- 
tions for the RR Lyrae sample of |Sesar et al.| l |2010| l (empty blue 
circles), and for 500-500 variable and non- variable random ob- 
jects (black dots and grey triangles, respectively). The simulated 
0.9999 quantiles of the simulated distributions are added as a red 
line. 



detecting excess variance in the point cloud of the scaled 
observations as compared to the variance of the first prin- 
cipal component of a 5-dimensional standard normal point 
cloud. Samples from even a spherically symmetric distribu- 
tion show stochastic distortions from the perfect sphere, and 
the smaller the number TV of the points in the cloud is, the 
stronger this distortion is. As PCA selects the direction of 
maximal spread in the data, we cannot expect d\ to be ex- 
actly 1. The distribution of df as a function of A'^ is the 
easiest to obtain with simulations. The zero hypothesis of 
non-variability of a source may be tested at a given confi- 
dence level Q by comparing the observed df of the source to 
the quantile c\-a of the simulated distribution: d\ > ci_q 
rejects the hypothesis of non-variability of the source at the 
level a. 

As a PCA-based selection criterion, we use the propor- 
tion of df to the total variance. Figure [s] sho ws its com- 
parison to the variable selection of Sesar et al. (20101. Five 



hundred objects that were flagged variable (black dots) and 
500 flagged non-variable (flUed grey triangles) are plotted. 



together with the RR Lyrae sample of Sesar et al. (20101 



(empty blue circles). The 0.9999 quantile derived from the 
simulations is denoted by the red line. For sources that are 
above this line, non-variability is rejected at the level of 
0.0001. The left panel shows the proportion of d\ to the 
total variance. Most of the objects flagged variable (black 
dots) are above the red line, meaning that the variance 
proportion criterion and the classical cuts select approxi- 
mately the same variable sample. Some difference neverthe- 
less can be observed. Among the stars flagged non-variable, 
the PCA-based criterion found some variables: the grey tri- 
angles above the red line. The significant variance excess 
seems to be due to two reasons. When the bands are weakly 
correlated, the excess might be the consequence of either 
microvariability, or correlated and underestimated errors 
across bands ( Scranton et al.|2005 l, so to detect microvari- 
ability in Stripe 82, a de-correlated version of scaling should 
be used. The other is that the variance stabilization did not 
succeed to obtain unit variance in one band. In this case, the 
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point cloud is elongated along one axis. Correspondingly, the 
coefficients vi of the first principal component contain usu- 
ally one value close to 1 at the band which is the origin of 
the excess variance and a value close to for all others. The 
characteristic proffie of vi helps to recognize these cases. 

Conversely, the black dots below the red line in the left 
panel of Fig. [5] represent points that are flagged variable but 
found non-variable by PCA. These seem to belong to two 
groups: one for which the flag seems to originate from an 
outlier, and another that appears ball-like on the pairwise 
scatterplots like Fig. IT] but not with unit variance. The vari- 
able flag from the classical analysis in the latter case can be 
due to under-estimated but nearly non-correlated errors in 



all bands Scranton et al. (20051. If the under-estimation is 



of similar order in every band, then the PCA-based criterion 
is less biased than traditional cuts, as it is defined as a pro- 
portion. Moreover, vi would again contain one value close 
to 1 and all the others near-zero, so it is possible to filter 
out these cases based on the vi profile. In general, this form 
of vi can be used to recognize and filter the cases where the 
excess scatter is due to underestimated errors in one or a 
few bands rather than true variability of the source. 

4-1.2 Period search on PCI time series 

For the majority of the investigated variable objects, period 
search on the SDSS time series suffers from aliasing prob- 
lems. The daily and yearly observational patterns (see e.g. 
Sesar et al. 2007 2010 1 create aliases with complex struc- 



tures, displaying combinations of the variability frequency 
with the 1/day and 1/year frequencies with comparable am- 
plitudes. According to the simulation results of Section [3. 2[ 
using the time series of zi with weights defined by lllj im- 
proves on period search on g-band. We applied this proce- 
dure, complemented by a non-linear optimization to find the 
exact value of the frequency, to a small random sample of 
2000 stars fiagged variable, which contained 36 known RR 
Lyraes. The published periods of these RR Lyraes, as de- 



scribed in Sesar et al. (20101, were determined by a visual 
inspection of light curves folded with the five best periods 
returned by the SuperSmoother algorithm ( Reimanii||1994 1 
restricted to the [0.2, 1] day interval. For 30 out of the 36 
RR Lyrae, the result from our algorithm coincided with the 
true frequency. In the 6 other cases, daily alias periods were 
found. This suggests that we can expect good period search 
results from a simple automatic procedure even in the ab- 
sence of visual inspection. 

We applied the weighted period search method (without 
the nonlinear optimization) on the full variable star set. The 
dominant frequency was used as an attribute in the classi- 
fication procedures, presented in Section [4. 2[ The classifica- 
tion selected a sample of 317 candidate RR Lyrae and HADS 
stars. For these, we modelled the folded zi light curve, com- 
puted the residuals r = Zi — Zi, and performed another 
period search on r as described in Section |3.2[ The found 
likely multi-period objects are discussed in Sections |4.2.2| 
and l4T3l 

4-1.3 Characterisation of variability with PCI spectrum 

RR Lyrae variables show a characteristic wavelength- 
dependent amplitude pattern: their variability is stronger at 



shorter optical wavelengths than at the red end of the spec- 
trum ( Smith|19"95 1 . If we could observe RR Lyrae stars with 
SDSS filters with equal errors in all bands, we would see the 
first principal direction tilted towards the blue rather than 
the red bands, and correspondingly, larger coefficients vi for 
blue than for red wavelengths. However, the error patterns 
distort this simple picture. We must scale the bands in order 
to get rid off of the effect of different average error levels, 
and the downscaling will be stronger in bands with larger 
errors, most notably in u and z- As a result, variability am- 
plitudes get downscaled as well. Thus, the typical RR Lyrae 
vi pattern (the PCI spectrum) takes a characteristic shape. 
It is composed of, on the one hand, the amplitude pattern 
of pulsating variables governed by the physical parameters 
and pulsation mode, and on the other, of the scaling pat- 
terns of the survey. Typical PCI spectra of identified RRab 
and RRc stars are shown with black lines in the middle panel 
of Fig.|6] They exhibit small coefficients u^i and v^i for the 
u- and 2-band contributions, and the highest value is Vgi at 
the g-band, which has high variability amplitude and low 
errors. 

HADS variables are in some aspects similar to the RR 



Lyraes (see e.g. Breger ( 1980 \ ; [McNamara (19951; Petersen 



fc Christensen-DalgaardI ( |1996[ ); |McNamara| ( |1997[); a light 
curve is shown in the bottom left panel of Fig. I6|). Their 
effective temperature is roughly in the same range, and al- 
though their pulsation may be more complex than that of 
RR Lyraes, they are mainly radial-mode pulsators. They 
show similar patterns across the wavelengths with decreas- 



ing amplitudes from blue to red wavelengths ( Pigulski et al 



2006 1, so we can expect their PCI spectrum to be similar to 



that of the RR Lyraes. The middle panel of Fig. [6] presents 
the PCI spectrum of several variables that were found to 
have periods, colours and folded light curves characteristic 
to HADS variables. These are plotted in blue, superposed 
on the lines of the RR Lyrae stars. 

For some other types of variability, we can expect differ- 
ent PCI spectra. Eclipsing binaries that are composed of two 
components of the same age and similar masses have similar 
colours, and therefore show only very weak colour changes 
and an almost-equal contribution of all bands to the light 
variation. This results in a fiatter PCI spectrum. Combined 
with the specific error pattern of SDSS, this yields a profile 
that is low at the noisy u and z bands and have a higher 
plateau at g, r and i. Several PCI spectra of this type are 
also shown in the middle panel of Fig. [6] these objects have 
clear eclipsing binary- type folded light curves. 

Figure |6] shows the use of the PCI spectrum for dis- 
criminating certain types. In the two panels on the right- 
hand side, we show the light curve of an RRc star from the 
confirmed sample of [Sesar et al.| ( |2010[ l (upper right panel) 
and an eclipsing binary candidate with only slightly different 
minima and near-sinusoidal light variation (bottom right). 
This eclipsing binary, when folded by half of the period, 
shows a folded light curve very similar to that of an RRc 
variable, as the tiny difference in the depths of the minima 
is masked by the error bars. If such eclipsing binaries fall 
furthermore close to the colour-colour region of RRc stars, 
the automated classification is difficult. However, the PCI 
spectrum is approximately fiat-topped for a geometric ori- 
gin, whereas it has a peak at g for a pulsating RRc. The 
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Figure 6. Examples of folded light curves (left top: RRab, left bottom: HADS, right top: RRc, right bottom: eclipsing binary) and some 
PCI spectra (vi) of the classes RR Lyrae (black lines), HADS (blue) and eclipsing binary (grey). There is a marked difference between 
the two pulsating types and the eclipsing binaries. 



inclusion of the first principal direction vi can thus be help- 
ful in an automated classification procedure. 



4.2 Random Forest classification 

4-2.1 Preliminary selection 

As our goal is to extract only a few variable types of inter- 
est from the SDSS Stripe 82 sample, we define our classes 
as RRAB, RRC, HADS and O, corresponding to the four 
types RRab, RRc, HADS and 'other'. Supervised classifi- 
cation needs a set of objects with known types and well- 
measured attributes to train the classifier. For RRab and 



RRc type variables, the sample of Sesar et al. (20101 pro- 



vides such a set. For HADS stars, we do not have a confirmed 
database. However, they form a relatively well-defined, sep- 
arable class of variable stars, so we can select visually a 
sample of plausible HADS candidates for the training set. 
Class O, on the other hand, is broad enough to elude all 
concise, easy-to-implement definition. Our only requirement 
can be precisely this broadness: class O in the training set 
should contain representatives of all alternative types of vari- 
ables mixed together, except for RR Lyrae or HADS stars. 
A reasonable criterion is therefore the separability of this 
class from HADS, RRAB and RRC. Thus, to obtain a clean 
training set, we apply Random Forest classification on the 
visually selected sample, then we remove all HADS and O 
objects that were misclassified, and we iterate these steps, 
until we reach a sample in which these two classes are rec- 
ognized cleanly. As the RRAB and RRC classes are already 
confirmed, we removed only the O objects that were classi- 
fied as RRAB or RRC, but not the RRAB and RRC stars 
that were misclassified as O. 

For the visual selection of HADS training sample, we 
have expressed all possible selection criteria in the frame- 
work of PCA. The extraction of variable objects was per- 
formed with the aid of the variance proportion of the first 



principal component. To reflect the astrophysical properties 
of the sought stars, we require them also to show the char- 
acteristic colour changes during the cycle. In the language 
of PCA, this corresponds to have broadly the PCI spectrum 
demonstrated by the blue lines in the middle panel of Fig. 
[6] Furthermore, they must have admissible u — g and g — r 
colours. Also, the dominant frequency F^.^ of the time series 
of the first principal component zi must correspond to the 
HADS frequency range. In summary, the selection criteria 
are as follows. 

(i) PCI has a variance proportion higher than the 
0.9999-quantile of that of a 5-variate standard normal 
point cloud with the same number of observations. 

(ii) The PCI spectrum vi = {vui,Vgi,Vri,Vii,V2i)'^ sat- 
isfies the following cuts: < Vui < 0.8, 0.45 < Vgi ^ 
1, 0.3 < Uri < 0.75, 0.15 < Vii < 0.6, < w^i < 0.5. 

(iii) The (extinction-corrected) median colours are in the 
region 0.7 < u — g < 1.4 and —0.2 < 3 — r < 0.4 of the 
u — g, g — r diagram. 

(iv) F^, > 3/day. 

All cuts are given very broadly, so the resulting se- 
lection contains a majority of contaminating objects such 
as eclipsing binaries with flat PCI spectra, quasars, main- 
sequence variables and even RRc stars. Visual inspection 
of the (/-band light curves resulted in 117 plausible HADS 
candidates. The training set for the first cleaning iteration 
consisted of these as class HADS; the RRab and RRc vari- 



ables of Sesar et al. (20101 as classes RRAB (379 objects) 
and RRC (104 objects), respectively; and a random selec- 
tion of 2400 other (O) variables from the remaining variable 
set, yielding a training set of 3000 objects in total. For all 
objects we calculated the following attribute list: 

• the median (apparent) magnitudes in all bands, cor- 
rected for interstellar extinction using the dust map of 



Schlegel, Finkbeiner fc Davis ( 1998 1 
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Figure 7. Attribute ranking by mean accuracy decrease when classifying without the attribute in question. The components vig and 
v\i of the PCI spectrum are highly ranked, comparable to the frequency, r — i and g — r colours. 



• the median (extinction-corrected) colours u — g, g — r, 
r — i and i ~ z; 

• the interquartile range IQR„ = q„{0.75) — g„(0.25), . . . 
of every band and of the first principal component scores 
211, . . . , ziT as a percentile-based estimate of the light curve 
range, where qx{p) denotes the empirical p-quantile of the 
time series x (x = u, g, r, i, z and PCI); 

• percentile estimate S^ — [gu(0.9) + ?„(0.1) — gu(0.5)] 
[g„(0.9) — 5„(0.1)]~^, ... of the hght curve skewness for all 
bands, in addition to that of the first principal component 
scores; 

• the PCI spectrum Vi = (u„i,«5i,t;r-i,Wii, ^zi)^; 

• the dominant frequency F^.^ found by the period search 
method described in Section [3^ 



Selection of a small but sufficient attribute set is never- 
theless necessary: the performance of most data mining pro- 
cedures is sensitive to the presence of noise-like attributes. 
The importance of any attribute can be measured by Ran- 
dom Forest by for example calculating the average accuracy 
loss on trees generated without the use of the attribute in 
question. The importance plot for all attributes in Fig. [7] 
indicates that the best two attributes are the frequency and 
the p-band component Vg\ of the first principal direction. 
They are followed by the r — i and g — r median colours, 
then another two PCA-based attributes, Vi\ and the IQR of 
the zi time series. This latter proves to be more important 
than the g-band IQR. The relatively low accuracy loss due 
to omission of colours is a consequence of our preselection 
of colour range in the u — g, g — r plane and of the cor- 
relation between colours. According to a procedure similar 



to that apphed in Sections 4.2-4.4 of Dubath et al. (20111, 



the most performant attribute set is the 11 top-ranking at- 



tributes: -Fi 



zi,Vgi,r -i,g- 



,Vii,lQR^ 



1 ^a 



, i^z 



and IQR„. This is what we used for the following cleaning 
of the training set and the extraction of our set of candidate 
RR Lyrae and HADS stars. 

The next stage is the cleaning of the training set, to 
obtain clearly separated HADS and O classes. In the first 
iterative step of this. Random Forest was trained on the 



3000 objects using the attribute list presented above. At 
this stage, the set O may contain unrecognized RR Lyrae 
and HADS stars. We assume that this contamination is low 
enough not to bias strongly the automated classification re- 
sults for class O in the first run, as these are relatively rare 
compared to all other types taken together. An overall er- 
ror rate less than 1% in the first iteration confirmed this 
assumption. After each run of Random Forest, we removed 
the objects from type O and HADS that were misclassified 
(keeping all RR Lyrae regardless of predicted type, since 
these are confirmed variables), and repeated classification. 
After 3 iterations, we ended up with 108 HADS and 2372 O 
stars, which were separated perfectly from each other and 
from the RR Lyrae classes. The only confusion arose from 
10 RRAB and RRC stars that were classified as type O, 
and one RRAB classified as RRC. This procedure biases the 
subsequent extraction of HADS, RRAB and RRC towards 
purity against completeness, as we dropped all cases that 
might represent unusual specimens of classes, and the final 
selection does not fully take into account eventual real simi- 
larities between stars belonging to distinct types. The choice 
between purity and completeness is certainly subjective, and 
depends on the purpose of the study. 



Re-processing the whole variable collection by Random 
Forest using the resulting training set gave 317 candidates, 
163 HADS, 82 RRC and 72 RRAB stars. Multifrequency 
analysis as described in Section [3.2| was performed on them. 
The final visual check was done with the help of 'portraits' 



of the stars: summary plots (Figures Al A7 1 that contained 
the most important information that could have been ex- 
tracted from the data, namely, zi and residual frequency 
spectra, folded zi, g, g ~ i and residual light curves, the 
raw observed time series, colour-colour, colour-period and 
period-amplitude diagrams, and PCI spectrum. These plots 
are presented in the Appendix. The visually selected 132 
HADS, 68 RRab, 36 RRc and 25 multiperiodic or peculiar 
RR candidates are discussed in the next two sections. 
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Table 1. RR Lyrae candidates in Stripe 82. ID: object identifier; R.A.: right ascension (deg); Decl.: declination (deg); Mmedv^ median 
magnitudes corrected for the ISM extinction; -Fzj: pulsation frequency determined from generalised least squares period search on PCI; 
Au,...: amplitudes from a B-splines fit to the folded light curves. The complete table is available online. 

ID R.A. Dccl. Type ^tmed Qmed ^med ^med ^med ^zi ^u ^g ^r ^i ^z 



510582 


-11.901668 


-0.947130 


RRAB 


20.22 


19.14 


18.90 


18.79 


18.79 


1.24630 


0.20 


0.19 


0.13 


0.12 


0.10 


651051 


19.017164 


1.262702 


RRAB 


18.27 


17.07 


16.80 


16.72 


16.69 


1.77625 


0.84 


0.97 


0.63 


0.50 


0.48 


1013845 


23.603725 


-0.592160 


RRAB 


17.32 


16.11 


15.87 


15.77 


15.77 


1.36722 


0.78 


0.75 


0.67 


0.55 


0.43 


1918041 


-31.227027 


-0.703085 


RRC 


20.83 


19.69 


19.62 


19.61 


19.69 


2.65171 


0.34 


0.44 


0.26 


0.20 


0.19 


2654711 


-41.834262 


-0.994213 


RRC 


17.66 


16.48 


16.54 


16.62 


16.69 


4.88670 


0.16 


0.19 


0.14 


0.11 


0.10 


2725572 


-40.845223 


-0.719055 


RRAB 


20.15 


19.03 


18.74 


18.69 


18.69 


1.77540 


1.40 


0.31 


0.71 


0.63 


0.49 


3352291 


-46.471817 


1.260386 


RRAB 


15.85 


14.67 


14.41 


14.33 


14.34 


1.95774 


1.18 


1.20 


0.86 


0.69 


0.63 



Table 2. Best-fitting templates of jSesar et al.| ( |2010^ for the new RR Lyrae candidates in Stripe 82. ID: object identifier; R.A.: right 
ascension (deg); Decl.: declination (deg); d: heliocentric distance for stars included in Fig. 19] {V): mean dereddened V-band magnitude, 
determined from a best-fit V-hanA template synthetized from the best-fitting g and g templates, A'^,...: amplitudes; 0g: epoch of 
maximum brightness; uq: the magnitude at the epoch of maximum brightness corrected for the ISM extinction (the last three determined 
from the best-fit template); T„,...: best-fit template identifier number as given in |Sesar et al.| l |2010[ |. The asterisks indicate slightly 
underestimated values, d and {V) were computed only for those RRab variables that are plotted on Fig.^ and are therefore missing for 
some stars. The complete table is available online. 



ID 


R.A. 


Dccl. 


d 


{V) 


Period 


A'„ 


4-0 


no 


Tu 


A'g ... 


510582 


-11.901668 


-0.947130 






0.802373 


0.194 


54358.25 


20.113 


104 


0.187 


651051 


19.017164 


1.262702 


17.256 


16.785 


0.562973 


0.604* 


53639.39 


17.766 


100 


0.663* 


1013845 


23.603725 


-0.592160 


11.322 


15.870 


0.731397 


0.687* 


54029.29 


16.816 


101 


0.659* 


1918041 


-31.227027 


-0.703085 






0.377115 


0.375 


54012.05 


20.612 





0.416 


2654711 


-41.834262 


-0.994213 






0.204637 


0.161 


53675.09 


17.567 





0.167 


2725572 


319.154777 


-0.719055 


42.543 


18.744 


0.563272 


0.626 


53625.21 


19.650 


102 


0.666* 


3352291 


313.528183 


1.260386 


5.600 


14.341 


0.510786 


0.559* 


53704.09 


15.349 


100 


0.598* 



D 3874813, P = 0.63, A=0.69, tmpl=100, Q=5.91 




Figure 8. g-band template fit for star 3874813, using the tem- 
plates of |Sesar et al.| (J20Tol , an example that cannot be fitted well 
by these. For comparison, see the second panel from top in the 
middle column of Fig. |A1[ which shows the same star fitted with 
a B-splinc. 



4-2.2 Candidate RR Lyrae variables 



The list of the 68 RRab and 36 RRc type objects is given in 
Table IT] containing the identifier, coordinates, median mag- 




Figure 9. Heliocentric map of the Stripe 82 RRab stars, based on 
average halo metallicity value. The radial axis is the heliocentric 
distance in kpc, the angle is the equatorial right ascension. Blue 
dots: the sample of |Sesar et al.| l [2010] |, red dots: the new candidate 
RRab. 



nitudes corrected for interstellar extinction, the frequency 
and the bandwise amplitude estimates derived from a B- 
splines fit to the folded light curve. A summary plot for one 
of them is presented in Fig. |A1[ For the ease of eventual 
further analysis and in order to present this addition to the 
sample of |Sesar et al.| ( |2010[ | in a coherent way, we fitted 
this new set of stars with the templates published there, 
and listed the template amplitude, the epoch of maximum 
brightness, the magnitude at the epoch of maximum bright- 
ness and the identifier of the best template in Table[2] There 
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Table 3. Double-mode RR Lyrae candidates in Stripe 82. Notation is the same as for Tablell] _Fr: secondary frequency, Ratio: ratio of 
fundamental to first overtone frequency, Py-. P- value of secondary frequency. 



ID 


R.A. 


Docl. 


^m,!d 


a-mad 


J-mcd 


»mcd 


^med 


P-1 


Fr 


Ratio 


Pr 


Au 


Ag 


Ar 


Ai 


A^ 


1059995 


-23.605640 


-0.056749 


21.02 


19.94 


19.75 


19.75 


19.78 


2.48876 


1.85620 


0.74583 


0.000 


0.19 


0.36 


0.27 


0.22 


0.09 


1346981 


25.772163 


1.097024 


18.20 


17.09 


17.03 


17.03 


17.04 


2.82707 


2.10280 


0.74381 


0.000 


0.40 


0.43 


0.31 


0.23 


0.20 


1638185 


30.812051 


1.205761 


18.05 


16.93 


16.80 


16.80 


16.80 


2.84693 


2.11660 


0.74347 


0.000 


0.48 


0.49 


0.37 


0.31 


0.24 


1875049 


-30.900196 


0.941754 


19.66 


18.57 


18.50 


18.48 


18.48 


2.41850 


1.80320 


0.74558 


0.000 


0.20 


0.35 


0.24 


0.19 


0.22 


2249641 


-39.804740 


0.210125 


17.47 


16.38 


16.31 


16.32 


16.34 


2.78830 


2.07536 


0.74431 


0.000 


0.45 


0.48 


0.33 


0.26 


0.22 


2662389 


-44.787449 


0.404872 


18.22 


17.13 


17.06 


17.08 


17.11 


2.84001 


2.10780 


0.74218 


0.000 


0.38 


0.44 


0.30 


0.24 


0.21 


2740388 


-44.211467 


-1.202920 


17.61 


16.54 


16.47 


16.45 


16.50 


2.76911 


2.06080 


0.74421 


0.000 


0.45 


0.49 


0.34 


0.28 


0.21 


3091639 


48.387957 


0.715216 


19.33 


18.24 


18.14 


18.16 


18.19 


2.82369 


2.09916 


0.74341 


0.000 


0.44 


0.55 


0.36 


0.29 


0.23 


3182847 


-48.344716 


0.330749 


20.53 


19.42 


19.33 


19.36 


19.40 


2.77071 


2.06144 


0.74401 


0.000 


0.38 


0.50 


0.39 


0.28 


0.29 


3524879 


-49.385512 


-0.480494 


17.52 


16.37 


16.28 


16.27 


16.29 


2.42901 


1.81232 


0.74612 


0.000 


0.39 


0.47 


0.32 


0.26 


0.21 


4185977 


-52.885267 


-0.892139 


15.96 


14.84 


14.71 


14.70 


14.75 


2.64865 


1.97572 


0.74593 


0.000 


0.48 


0.53 


0.38 


0.30 


0.26 


5631911 


-57.845165 


-0.839656 


19.85 


18.74 


18.66 


18.57 


18.63 


2.43680 


1.81908 


0.74650 


0.000 


0.49 


0.51 


0.34 


0.27 


0.26 


1149344 


-21.758206 


0.077395 


20.22 


19.16 


19.03 


18.99 


18.99 


2.40371 


2.79388 


0.86035 


0.000 


0.32 


0.37 


0.25 


0.18 


0.18 


1283137 


27.742275 


-0.847584 


19.08 


17.92 


17.83 


17.82 


17.84 


1.75063 


2.04704 


0.85520 


0.000 


0.40 


0.48 


0.32 


0.24 


0.22 



Table 4. RR Lyrae candidates with closely spaced multiple frequencies or indicating period, phase or amplitude shifts in Stripe 82. 
Notation is the same as for Table [s] 



ID 


R.A. 


Docl. 


"med 


9mad 


J-mcd 


Smcd 


Zmcd 


P-1 


Fr 


Ratio 


Pr 


Au 


Ag 


Ar 


Ai 


A^ 


1139404 


-24.140121 


0.369111 


17.71 


16.54 


16.54 


16.57 


16.62 


3.63296 


3.62848 


0.99877 


0.000 


0.33 


0.38 


0.28 


0.21 


0.18 


1945634 


-31.463604 


1.030967 


17.50 


16.32 


16.11 


16.01 


15.98 


1.53242 


1.53904 


0.99570 


0.035 


0.52 


0.58 


0.37 


0.27 


0.26 


2954798 


-41.387218 


-0.569949 


18.15 


16.95 


16.97 


17.01 


17.06 


3.15265 


3.15188 


0.99976 


0.001 


0.46 


0.53 


0.37 


0.30 


0.23 


3261654 


-45.603648 


0.149636 


17.01 


15.81 


15.83 


15.87 


15.93 


3.68724 


3.61116 


0.97937 


0.012 


0.22 


0.26 


0.18 


0.15 


0.11 


6058913 


-58.281094 


1.205592 


18.29 


17.13 


17.06 


17.08 


17.12 


2.55444 


2.55320 


0.99951 


0.040 


0.48 


0.50 


0.35 


0.30 


0.26 


6086465 


-57.509052 


-0.506643 


18.44 


17.26 


17.31 


17.38 


17.43 


3.63835 


3.71848 


0.97845 


0.010 


0.31 


0.32 


0.25 


0.21 


0.22 


700313 


15.188040 


-1.036745 


17.85 


16.72 


16.56 


16.50 


16.51 


2.99438 






0.603 


0.51 


0.66 


0.39 


0.30 


0.24 


1731088 


-30.677761 


-0.062001 


20.20 


19.14 


18.97 


18.96 


18.95 


1.96133 






0.357 


0.87 


0.91 


0.66 


0.51 


0.49 



were several RRab stars that did not have acceptable tem- 
plate fits for all or part of the bands, as the template set is 
not all-encompassing (Scsar ct al.^^20l6j |. Such light curves 
showed a sharper rising branch, a narrower peak and a de- 
creasing branch composed of a steep initial fading from the 
brightest state and a more prolonged near-minimum state 
than the templates. One example, the same star that is pre- 
sented in Fig. |A1| is shown in Fig. IS] The template ampli- 
tudes of these stars are underestimated on average by around 
10% (up to 20%); such cases are denoted in Table [2] by an 
asterisk following the amplitude value. For the RRab stars, 
we estimated heliocentric distance following Section 4.1.1 
of Sesar et al.| ( |2010[ ). We assumed the same average halo 
metallicity of [Fe/H] = -1.5 and a mean value My =0.6 for 
their absolute magnitude. The extended heliocentric map of 
the halo is presented in Fig. |9] The majority of the new 
variables is situated close to the Galactic bulge, continuing 
smoothly the already known structures towards the bulge. 

The multifrequency analysis, performed on all candi- 
dates with sufficient residual degrees of freedom, yielded 23 
further objects with multiple modes whose significance was 
confirmed by a combination of bootstrap and extreme- value 
procedures (though two with only very few residual degrees 
of freedom after smoothing), and two more showing clear 
light curve changes without indication of a second frequency. 
These are listed in Tables |3) |4] and [5] 

Table [S] contains the found double-mode RR Lyrae 
candidates showing the theoretically expected fundamental- 
overtone frequency ratio close to 0.745 (12 stars, around 2% 
of the total Stripe 82 RR Lyrae sample found in [Sesar et] 
al. 2010 and here). The summary information plot for one 
of them is presented in Fig. |A2[ These stars fit very well 
the expected relation between the fundamental period and 



the fundamental-overtone period ratio, as shown in Fig. |10[ 
The span of periods indicates a broad range of metallici- 
ties and/or masses to account adequately for the observed 
ranges of periods and ratios (see e.g. Alcock et al.|2000a l. In 
addition, TablelSllists two more objects for which the period 
search gave probably aliased results, and therefore the ra- 
tio is off for the frequencies corresponding to the maximum 
of the periodogram. However, supposing another alias to be 
the true frequency [Fr — 1 day^^ for the fundamental mode 
of object 1149344 and Fz^ + 1 day"'^ for the overtone in 
the case of 1283137), the stars are in the regions admissible 
for double-mode RR Lyrae variables on the colour-colour, 
colour-period and period-amplitude diagrams. With these 
frequency choices, they fit well also the relationship between 
the fundamental period and the period ratio, as shown by 
the empty circles in Fig. |10[ 

Table [4] contains RR Lyrae candidates with two closely 
separated frequencies and/or showing multiple thread struc- 
ture in their folded light curves, together with two others 
showing multiple threads without the presence of a resolved 
second frequency (an example is given in Fig. A3 1. Such 



properties may be due to the Blazhko effect, slow periodic or 
irregular modulations of the amplitude or of the period. The 
folded hght curve panels in Fig. |A3| indicate the Julian date 
of the observation with colours (the colour legend is given 
under the folded hght curves) , and the different threads visi- 
bly belong to different years of observations. For this object, 
the separation of the two frequencies is very small, suggest- 
ing possibly a long Blazhko period, the presence of period 
changes or phase shifts. We list two more objects (6058913 
and 6086465) with very small number of data where the 
smoothing left very few residual degrees of freedom, but the 
bootstrap indicated presence of secondary modes. More ob- 
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Table 5. Multiperiodic RR Lyrae candidates with unusual period ratios in Stripe 82. Notation is the same as for Tablels] 



ID 


R.A. 


Docl. 


^med 


9mcd 


rmcd 


imcd 


Zmad 


F-1 


Fr 


Ratio 


Pr 


A„ 


^9 


Ar 


Ai 


Aj 


1528004 


-29.845729 


-0.438433 


18.35 


17.19 


17.07 


17.05 


17.08 


3.05227 


5.03044 


0.60676 


0.020 


0.87 


0.92 


0.67 


0.53 


0.46 


3252839 


-46.822139 


-1.190310 


17.83 


16.69 


16.70 


16.75 


16.80 


3.21292 


5.15024 


0.62384 


0.001 


0.35 


0.36 


0.29 


0.20 


0.20 


3218459 


-47.774708 


0.387032 


17.23 


16.10 


16.13 


16.20 


16.27 


4.48794 


3.60676 


0.80366 


0.000 


0.15 


0.17 


0.12 


0.10 


0.08 



Double-mode RR Lyrae 




0.48 



0.50 0.52 

PO [days] 



0.54 



0.56 



Figure 10. Petersen diagram for the candidate double mode RR 
Lyrae variables. Black dots: candidates with well-identified fre- 
quency peaks, empty circles: aliased candidates, plotted here with 
the corrected periods. 



servations of these objects are needed before their type can 
be resolved. 

Several other RR Lyrae candidates showed two signif- 
icant peaks at neither closely spaced nor fundamental-first 
overtone frequencies. Objects with similar frequency ratios 

in OGLE data 



2010 



were recently found by Soszyhski et al. 

in the LMC and by|01ech fc Moskalik[ ( |2009| ) in oj Centauri. 
We present three of them in Table [5 for which the Monte 
Carlo assessment showed these secondary peaks significant. 
The first one, with its period ratio around 0.8, may be a 
candidate double-mode RR Lyrae pulsating in the first and 
second overtone (its summary information plots are given 
in the appendix. Fig. |A4[ ); the other two show period ratios 
around 0.6. These objects need further data for confirma- 
tion. 



4.2.3 HADS candidates 



Using the training set described in Section [4. 2. 1| the Ran- 
dom Forest procedure selected 163 candidates. This was re- 
duced by a subsequent visual inspection to 132 accepted 
HADS candidates, 109 monoperiodic and 23 multiperiodic 
stars. These variables often have frequency spectra with 
complex structure, usually several secondary peaks beside 
a highly dominant frequency. The significance of the sec- 
ondary peaks was much less prononunced in the multiperi- 
odic HADS sample than among the double-mode RR Lyraes. 
Table |6] presents a summary of the basic properties of the 
monoperiodic sample, similarly to Table IT] Two examples, 
one with symmetric and one with asymmetric folded light 
curve are presented in Figs. |A5] and |A6[ All frequencies given 
in Table |6] can be affected by aliasing, though this does not 



modify the classification of the stars, as the typical frequen- 
cies of the HADS are high, and the aliasing would hardly 
lead to RR Lyrae frequencies and a consequent misclassifi- 
cation. In the absence of metallicity estimates or spectro- 
scopic information, their further division into Population I 
[5 Scuti) and Population H (SX Phoenicis) objects is not 
possible. 

In this sample, we also found several suspected multi- 
periodic objects. We included a star into our multiperiodic 
set only if two conditions held: (i) the combined bootstrap 
and extreme- value methods furnished a P- value smaller than 
0.05 for the peak in the residual spectrum, (ii) the visual 
check of the folded residual light curve showed perceptible 
systematic variation rather than several outliers grouped by 
the folding. The properties of these stars are summarized in 
Table [7] We divided the table into two parts, the top half 
containing 12 objects with residual peak more significant 
than 0.01, and the bottom half another 11 with secondary 
peaks with P-value between 0.01 and 0.05. Among our mul- 
tiperiodic candidates, we observed a broad variety of ratios 
between the primary and the residual periods at various 
levels of significance: they range from around 0.65 to 0.86, 
in addition to a group at very high ratios (> 0.92), and 
one peculiar object with a ratio 0.19. The scatter of these 
values suggest a wide variety of objects of diverging types 
and with large differences in their characteristics such as the 
evolutionary state, mass, metal content or rotation. The ma- 
jority of the most significant secondary peaks seem to form 
close doublets with the main frequency. Such behaviour is 
quite frequent among low-amplitude 5 Scuti and SX Phoeni- 
cis stars (see e.g.|Poretti|2003[|Breg"er fc Bishof|2"002l|Mazur,| 



Krzeminski fc Thompson|2003 |. However, aliasing makes it 
difficult to find the true frequency in both principal com- 
ponent and residual spectra, and the presence of a strong 
secondary peak may be also the consequence of a not suf- 
ficiently precise primary frequency identification due to the 
relatively scarce number of observations per time series. 
A few stars of our sample exhibits signs of amplitude or 
phase/period changes, light curve threads separated by ob- 
servational year and showing a high dispersion of the peak 



values (the clearest example is 225151, shown in Fig. A7l 



5 DISCUSSION 

Multi-filter observations comprise much information about 
the characteristics of the observed star through the colour 
variations. We tested robust PCA as a way to extract infor- 
mation about these variations. We found that PCA produces 
several useful quantities: the proportion of the variance of 
the first principal component to the total variance, the time 
series of the first principal component, a robust notion of 
outlyingness, and the direction of the first principal compo- 
nent. 



14 M. Siiveges et al. 

Table 6. HADS candidates in Stripe 82. Notation is the same as for Table IT] 



ID 


R.A. 


Docl. 


"med 


am,sd 


rmcd 


imed 


^mad 


F.l 


A„ 


^3 


Ar 


Ai 


Aj 


558620 


-12.512990 


-0.604203 


20.50 


19.42 


19.46 


19.50 


19.56 


23.22362 


0.59 


0.69 


0.49 


0.34 


0.30 


516958 


-13.867340 


0.412679 


17.37 


16.40 


16.32 


16.37 


16.43 


17.68620 


0.63 


0.76 


0.53 


0.41 


0.35 


825896 


-16.982811 


-0.869013 


17.57 


16.41 


16.33 


16.37 


16.43 


20.17531 


0.19 


0.18 


0.14 


0.10 


0.08 


800977 


-18.396891 


1.189213 


18.21 


17.21 


17.17 


17.20 


17.26 


19.90575 


0.25 


0.30 


0.24 


0.18 


0.18 


799691 


-19.685526 


-0.304110 


20.58 


19.63 


19.61 


19.65 


19.67 


19.11502 


0.43 


0.59 


0.41 


0.30 


0.18 


407172 


-8.778812 


-0.076337 


20.63 


19.67 


19.64 


19.71 


19.79 


26.77966 


0.15 


0.23 


0.18 


0.14 


0.01 


16959 


0.818054 


1.074244 


20.52 


19.41 


19.45 


19.51 


19.49 


16.82779 


0.71 


0.71 


0.50 


0.42 


0.36 


173268 


10.194452 


-0.987410 


20.01 


19.02 


18.86 


18.83 


18.85 


16.12342 


0.18 


0.24 


0.14 


0.09 


0.06 


187850 


13.118868 


-0.016244 


20.79 


19.78 


19.70 


19.75 


19.79 


19.19475 


0.20 


0.26 


0.16 


0.10 


0.08 


610306 


17.691082 


0.312217 


20.60 


19.59 


19.57 


19.63 


19.69 


19.46323 


0.28 


0.26 


0.20 


0.14 


0.29 


695598 


18.277557 


-0.993741 


21.00 


19.92 


19.91 


19.97 


20.07 


20.49191 


0.06 


0.18 


0.13 


0.13 


0.07 



Table 7. Multiperiodic HADS stars in Stripe 82. Notation is the same as for Table Is] 



ID 


R.A. 


Dccl. 


"med 


9med 


r,„ed 


'mcd 


^med 


F.l 


Fr 


Ratio 


Pr 


Au 


A, 


Ar 


Ai 


Aj 


4936224 


58.031763 


0.502842 


17.71 


16.81 


16.81 


16.85 


16.94 


17.91900 


17.91776 


0.99993 


0.000 


0.24 


0.27 


0.20 


0.14 


0.13 


4433015 


-51.994003 


1.211038 


19.39 


18.37 


18.34 


18.38 


18.38 


15.86637 


16.08300 


0.98776 


0.008 


0.30 


0.26 


0.17 


0.11 


0.13 


2816955 


-44.450874 


0.835961 


16.95 


15.95 


15.97 


16.06 


16.14 


25.80457 


25.15836 


0.97496 


0.000 


0.19 


0.22 


0.16 


0.13 


0.11 


1113593 


-20.457095 


-0.775377 


17.81 


16.74 


16.74 


16.77 


16.86 


21.71685 


21.04152 


0.96890 


0.000 


0.23 


0.29 


0.21 


0.17 


0.14 


1741727 


-31.076033 


-0.079779 


20.06 


19.01 


18.95 


18.99 


19.05 


20.68204 


19.11928 


0.92444 


0.000 


0.02 


0.17 


0.12 


0.11 


0.10 


3642254 


-47.102871 


0.553778 


17.37 


16.30 


16.33 


16.40 


16.47 


21.67150 


18.87376 


0.88167 


0.001 


0.23 


0.30 


0.20 


0.16 


0.15 


5401947 


-55.894772 


0.624779 


16.67 


15.60 


15.62 


15.67 


15.78 


11.90623 


9.89844 


0.83137 


0.001 


0.61 


0.67 


0.49 


0.37 


0.33 


3269918 


-45.136352 


-1.230279 


18.67 


17.63 


17.68 


17.75 


17.84 


19.09286 


24.45904 


0.78061 


0.000 


0.46 


0.49 


0.37 


0.28 


0.24 


2196466 


-35.874290 


-0.357666 


19.32 


18.30 


18.19 


18.19 


18.23 


9.31067 


11.95100 


0.77907 


0.007 


0.17 


0.22 


0.14 


0.10 


0.09 


2777345 


-42.763135 


0.482715 


17.94 


16.93 


16.83 


16.84 


16.90 


18.57330 


24.30612 


0.78414 


0.003 


0.21 


0.24 


0.18 


0.12 


0.09 


2383752 


-37.431381 


-0.842875 


21.33 


20.31 


20.27 


20.30 


20.32 


13.15213 


9.90024 


0.75275 


0.009 


0.25 


0.40 


0.27 


0.20 


0.18 


5415273 


-57.074003 


-0.422064 


18.34 


17.30 


17.25 


17.30 


17.35 


14.23765 


10.29948 


0.72340 


0.000 


0.16 


0.19 


0.13 


0.10 


0.09 


713584 


-16.915188 


0.737187 


19.40 


18.37 


18.38 


18.41 


18.48 


20.72999 


20.92128 


0.99086 


0.015 


0.21 


0.26 


0.19 


0.15 


0.12 


2298258 


-36.580453 


0.002215 


20.43 


19.49 


19.50 


19.57 


19.62 


20.24135 


19.23308 


0.95019 


0.044 


0.39 


0.49 


0.36 


0.29 


0.24 


421829 


-12.173045 


0.426865 


20.79 


19.78 


19.84 


19.94 


20.05 


23.13966 


25.14084 


0.92040 


0.034 


0.16 


0.22 


0.17 


0.11 


0.07 


225151 


10.458869 


-0.877921 


17.94 


16.84 


16.87 


16.91 


16.98 


18.25562 


22.35908 


0.81647 


0.015 


0.45 


0.52 


0.37 


0.30 


0.25 


4064319 


5.074867 
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We found that the proportion of PCI variance to the 
total variance is a useful variability indicator. The PCI spec- 
trum can help detect cases when excess variance is due to 
underestimated errors, by having one dominant and four 
near-zero elements. The time series of the PCI yields the 
best attainable signal-to-noise ratio. Also, weights for period 
search can be defined based on a robust notion of outlying- 
ness. Simulations and a test on real data containing several 
known RR Lyrae stars confirmed that period search on the 
PCI time series with robust weights decreases the impact of 
aliasing on the results. 

The direction of the largest variation in the 5- 
dimensional point cloud, which is termed the PCI spectrum, 
proved to be useful in classification. For our sought sample of 
radially pulsating variables between 6500-7500 K, combined 
with bandwise SDSS error patterns, this direction points to- 
wards the g band, with decreasing contribution from r, i, u 
and z bands. It was used in the Random Forest classification 
procedure as a new attribute helping to separate pulsating 
variables from eclipsing binaries. The coefficients of the g 
and i bands ranked among the best five attributes, beside 
the period and the r — i and g — r colours. 

The study produced 132 HADS, 68 RRab, 36 RRc and 



25 multiperiodic or peculiar RR Lyrae variable stars. The 
RR Lyrae are new addition to the confirmed RR Lyrae list 



of Sesar et al. (20101. The vast majority of these new RR 
Lyrae stars were found in the regions not originally consid- 
ered by Sesar et al. (20101. Only a few type-ab RR Lyrae 



stars were missed by their work in the range considered by 
that study (308 deg < RA < 60 deg and |Dec| < 1.25), 
making their sample about 99% complete. Among the new 
multiperiodic RR Lyrae, there are 14 double-mode candi- 
dates, several others showing signs of Blazhko effect, period 
or phase change, and we found candidates with unusual pe- 
riod ratios, of which one may be a double-mode star pulsat- 
ing in the first and second overtone. Among the HADS can- 
didates, the multiperiodicity seems to show a broad variety 
of ratios between periods. The HADS candidates, similarly 
to the multi-mode RR Lyrae, need eventual observational 
follow-up to clarify and confirm their type and pulsation 
modes. 

This work yielded promising results about the utility of 
PCA, and opens the way to further improvements. Missing 
data reduce the number of observations in the time series of 
the first principal component to the number of data in the 
most scarcely observed band. Thus, though signal-to-noise 
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ratio improves with the application of PCA, the decrease 
in the number of observations in the time series can imply 
worse performance of period search. Optimization of the pe- 
riod search performance can be achieved either by restricting 
the analysis to the combination of only the best bands, or 
by a statistical methodology that is able to deal with the 
missing values. Further investigations will explore these in- 
teresting possibilities. 
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APPENDIX A: SUMMARY INFORMATION 
FIGURES FOR VISUAL CHECKS 

These figures summarize the most important features of a 
star that can be derived from 5-band photometric time se- 
ries. 

The two plots on the top left-hand side contain the fre- 
quency spectrum of the time series of zn, . . . , z\t and that 
of the residual PCI time series, with the found most signif- 
icant frequencies in each emphasized by an orange line. 

The three top panels in the middle column present the 
light curves of PCI scores, g-magnitudes and g ~ i colour 
folded with the main frequency, and the two top right pan- 
els show the light curves of the PCI and g residuals folded 
by the residual frequency. Spline smoothed lines are added 
to the PCI and p-band light curves. The folded light curves 
are colour-coded according to the Julian date of the obser- 
vation; this is particularly useful for detecting slow phase 
or amplitude changes. The colour code is given in a stripe 
under the residual light curves. 

In the bottom row, we show the raw time series of ob- 
servations, the (u — g)-{g — r), the {g — i)-logio (period) and 
the period-amplitude diagrams, and the PCI spectrum. The 
raw time series in the leftmost bottom panel show the u-, 
g-, r-, i- and 2-bands in blue, green, red, brown and black, 
respectively; time is given in Julian Dates. This panel is use- 
ful for detecting trends, shifts or eventual other deviations 
in the data that might cause problems in the period search. 
The second bottom panel is the (u — g)-(g — r) colour-colour 
diagram. Light grey points here represent the general vari- 
able sample obtained by the condition imposed on the vari- 
ance df of the first principal component and the cuts on the 
PCI spectrum vi (steps 1 and 2 of the preliminary selection 



procedure in Section 4.2.1 1. Black circles and dark grey dots 
refer to the RRab and the RRc sample of Sesar et al. (2010|, 



respectively. The orange blob represents the candidate star. 
In the [g — j)-logio (period) and the period-amplitude di- 
agrams (third and fourth panels in the bottom row), the 
light grey points represent only the candidate HADS and 
RR Lyrae variables (without distinction) that were selected 
by Random Forest. Black circles, dark grey dots and the 
orange blob have the same meaning as in the colour-colour 
diagram. For the amplitude-period diagram, we used the 
percentile- based estimate of the g light curve range as de- 
scribed in Section |4.2.1| The last plot, the PCI spectrum 
show the vi values of the object as an orange line versus the 
effective wavelengths (in A) of the filters. The two grey lines 
here give the lower and the upper boundary of the band oc- 
cupied by the 483 confirmed RR Lyrae stars of | Sesar et al.| 



(20101 
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Figure Al. An RRab candidate. Legend is given in the text of tiie Appendix. 
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Figure A2. A double-mode RR Lyrae candidate. Legend is given in the text of the Appendix. 
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Figure A3. An RR Lyrae candidate showing signs of phase and ampUtude shift and possible Blazhko behaviour. Legend is given in the 
text of the Appendix. 



20 M. Siiveges et al. 






- ' \i 




^ * V* 


1 


\ 


1 


*/ - 




y' ^ 




*'^ / 


} 


* /* 




* "**T 






f 








^ £ 
« ^ 






■ \* 




1 T 1 f 1 \ l-l 



(E IK 01 IE- or- 



1^ 

I 




f? 


,-^i^ 


^ 1 
1 1 


_=^- 




^^^- 


:j 


_ — ±^- 




- g 



O"! S"I1 DO 






mil mil 010- 



5 0- 1- 



-1 1 \ 1 1 1-^ 

5 fO ta I'd I'd O'O 



-I 1 \ — 

S'91 O'^l S'll 



Figure A4. A double-mode RR Lyrae candidate, possibly pulsating in the first and second overtone. Legend is given in the text of the 
Appendix. 
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Figure A5. A HADS candidate with symmetric light curve. Legend is given in the text of the Appendix. 
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Figure A6. A HADS candidate with asymmetric light curve. Legend is given in the text of the Appendix. 
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Figure A7. A double-mode HADS candidate with amplitude change indications. Legend is given in the text of the Appendix. 



